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Abstract 

In this Letter we present an analytic study of sampled networks in the case of some important 
shortest-path sampling models. We present analytic formulas for the probability of edge discovery 
in the case of an evolving and a static network model. We also show that the number of discovered 
edges in a finite network scales much more slowly than predicted by earlier mean field models. 
Finally, we calculate the degree distribution of sampled networks, and we demonstrate that they 
are analogous to a destroyed network obtained by randomly removing edges from the original 
network. 

PACS numbers: 64.60.aq, 89.20.Hh 
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Complex networks have attracted significant interest in recent years [l, 2]- In most cases, 
the entire structure of the network is unknown and one is left with statistical samples of the 
original network jsl, 4]. The sampling of Internet topology is one of the greatest challenges 
due to its enormous size and decentralized structure. It motivated numerous studies on the 
relationship between the original and the sampled network, including the degree distribution 
and the expected size of the network [8|. Recently, Internet sampling methods have 
emerged that rely on the measurement tool traceroute, which returns the sequence of IP 
addresses of the network nodes along the path between the measurement host and a given 
destination host. An abstraction of the network discovery process consists of selecting a set 
of source and target nodes and finding the shortest paths between source and destination 
pairs. A node or an edge of the network is discovered if it belongs to one of those shortest 
paths. The stat istical properties of the discovered network have been studied extensively by 
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M]. The mean-field approximation has been developed in the limit of low 
source and target density p^pr ^ 1 by neglecting the correlation of different shortest paths. 

In this Letter we present exact results for certain networks. A surprising new finding 
is that the network discovery process is slower in these systems than it is predicted by the 
mean-field theory. While in mean-field approximation the number of discovered links scales 
with the product of the number of the source and target nodes, the new approach predicts 
a scaling only with their sum. The lower number of discovered edges is a result of the high 
degree of overlapping between shortest paths. Our other important finding concerns the 
degree distribution of the discovered network. We will show that it is analogous with a 
destroyed network where a fraction of the edges of the original network has been randomly 
removed. 

We investigate two main discovery strategies. In peer-to-peer sampling (P2P) each node is 
selected simultaneously for both source and target with probability p. Computer applications 
using the peer-to-peer principle discover the network this way, hence the name. In disjunct 
sampling (DI) each node is selected for source or target but not for both with probabilities 
ps and pr- This strategy is used in Internet mapping projects, where source computers 
belong to the measurement infrastructure, while a large number of random addresses are 
selected as targets. 

We start our analysis with the discovery of a tree. The most important observation 
permitting exact calculations in this case is that an edge separates the tree into two sides. 
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An edge is discovered only if the source and the target nodes reside on different sides of 
the edge. Let us denote the event that a node is selected as a source or target by S and 
T, respectively. Furthermore, we denote the event that at least one source or target node 
resides on the 'left' or 'right' side of the edge by Sl^ji and T^^ji, respectively. The event 
that a link is discovered, D, provided that its two sides L and R are known, is clearly 
D = (SlTr) + (SrTl). Therefore, we can express the conditional probability P{D\L,R) = 
P{Sl\L, R) PiTR\L, R) + P{Sr\L, R) P{Tl\L, R) - P{SlTl\L, R) P{SrTr\L, R). The prob- 
abilities arising in this expression can be calculated easily: P{S\ \ L,R) = 1 — P'^^{S), 
P{Tx \L,R) = 1- P^^(T) and P{SxTx \L,R) = 1- P^^(S) - P^^(T) + P^^(ST), where 
X = L OT R, Nl and Nr are the number of nodes on the two sides of the link, and the 
overlines denote complement events. 

Let us consider an evolving network where one new edge is attached randomly to the 
nodes of the existing network. The structure of this network will be a tree. Since the 
network is connected the cluster sizes A''^ and Nr must satisfy the relation N^ + Nr = N, 
where N is the size of the whole network. In the thermodynamic limit — > oo we obtain 
P{D I Nl) = 1 — cr^^, where we have introduced a = P{ST). The probability a in the 
different sampling models is related to the source and target densities in a simple way: 



(1) 



where p, ps, Pt ^ [0, 1], P5 + pr < 1- If P5 + pr ^ 1 in the DI sampling model, then we can 
write P{D | L) ^ 1 — exp {—^^jf^bf.) , where be = N^ {N — A'^) is the number of shortest 




paths that traverse a given link, cal 



mean field model of 
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ed betweenness centrality. Compare this result with the 
0: P(An.f. I &e) ~ 1 - exp i-psprbe). 



The probability of finding an arbitrary edge by traceroute probes can be given now 
straightforwardly: 

oo 

7id=Y, I Nl)P{Nl) = 1 - (2) 

where Hi{z) = J2nl ^^-^^^^^^ generating function of the cluster size distribution 

PiNL). 

Expression ([2]) has been tested on the Dorogovtsev-Mendez (DM) network growth 



model 



lOj, a generalization of the Barabasi-Albert (BA) model ll|], where new nodes with m 
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FIG. 1: (Color online) Discovery probability of edges T^d{p) as the function of the measurement 
node density p in P2P sampling of evolving trees. Data points are averaged over 100 realizations 
of iV = 10000 node random trees with a = 1 and +00. Dashed lines show the analytic solution 
([3]) with a = 1 — p. The inset shows the expected number of discovered edges (n^) as the function 
of the number of the measurement nodes n <^ N. The solid line represents (jl]) for P2P sampling, 
whereas the dotted line shows its leading term (/) n/2 with (/) = 9.045 and 15.48 for a = 1 and 
+00, respectively. 

new links are attached to old nodes with degree dependent probabihty Il{ki) = j^lk^rn+am) 1 
where a > 0. The growing tree corresponds to m = 1. We calculated the distribution P{Nl) 
for this model analytically in Ref. [U]. The generating function can be expressed in terms 
of hypergeometric functions Hi{z) = 2:2-^1(1 ~ a, 1,2 — a; z) — 2-^1(2 — a, 1, 3 — a; z) 
and a = At a = 1 we recover the original BA preferential attachment model with 

scale-free degree distribution and at a = +00 we obtain uniform attachment probability 
with exponential degree distribution. In these cases tt^ can be expressed with elementary 
functions 



Figure [T] shows simulations for the P2P sampling model at a = and 1/2. The analytic 
results ([3]), plotted with dashed lines, fit the simulation data excellently. 

From the point of view of the efficiency of the discovery process, it is important to 
calculate how many edges can be discovered with a given number of source ns and target 
nodes tit. For the Internet discovery the disjunct sampling model is relevant, where + 
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(a) ER model (b) Power-law model 

(7 = 3) 



FIG. 2: (Color online) Discovery probability of edges as the function of the fraction of the mea- 
surement nodes p in static networks. 100 all-to-all samplings were averaged in = 10000 size 
networks with average degrees {k) = 0.5, 1, 2 and 4. Solid lines show the analytic formula ([6]). 

Qs = ij^T + ns)/N = n/N = 1 — a ^ 1. The series expansion of yields na = I — 
P{Nl) (l — jf)^^ ■ We can rearrange the series by adding and subtrating the terms 
1 — and averaging them separately vr^ = "^j^^^ — 'Y1,Nl -^(^^) ~ ji)^^ ~ ^ ^ • 
Several authors have pointed out that the distribution of bp = Nl {N — Ni) follows 



a universal power-law tail in trees with exponent —2 12|, ll3|, llJ]. It also implies that 
asymptotically P{Nl) ^ cN£'^ in an arbitrary tree for A'^^^ ^ 1. Specifically, c = 1 — a in 
the DM model. Using this asymptotic form we can calculate the leading behaviour in the 
N oo limit rCd = cLi2(l— ri/A^)+c^— c-^(ln A^— 7), where Li2(x) is the dilogarithm 

function and 7 ~ 0.5772 is the Euler constant. For small argument Li2 (1 — x) can be 
expanded by using Euler's reflection formula Li2 (1 — x) = — Li2 (x) + ^ — ln(x) ln(l — x) ~ 
—X + ^ + X \n{x) + . . . . Finally we get vr^ = "^j^^^ + c-^ — c-^ Inn — c-^7. 

To process this further, let us express the term (A'^^,) more straightforwardly. The sum of 
for all edges clearly equals the total length of the shortest paths between all possible pairings 

of nodes: EeGi?^<= = J2i,j^vkj- Since (6) = j^J^e^E^e and (/) = Jnw^J2^J^vkj 
we can write {l)N/2 = (b). Therefore, the average branch size can be given as {N^) = 
(/) /2 + {Nl) /N, where {Nl) /N = j^J^N =1 W^l = c- For a large, but finite network the 
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FIG. 3: (Color online) Schematic diagram of an arbitrary vertex v with degree k and the emerging 
branches with sizes A'^i , , ■ ■ ■ , ■ Shaded circles represent branches where measurement nodes 
can be found in. Thick lines symbolize the discovered edges of node v. 

average number of discovered edges is (n^) = (A^ — 1) tt^, that is 

(nrf) ^ n — clnn + 2c — 07^ (4) 

in the hmit l<^n = ns + nT'^N, The above result shows that (rid) depends on the sum 
of ns and n^. This is in contrast to the mean field model, which predicts that (n^) scales 
with the product of ns and ut- The logarithmic term of (jl]) accounts for the possibility that 
a new measurement node is placed at a node discovered by previous measurement nodes. 
The inset of Fig. [1] displays simulation results and the formula corresponding to the P2P 
sampling. 

We continue with the analysis of a static model where nodes are randomly connected 
with a prescribed degree distribution pk. This 'configuration model' is a generalization of 



the Erdos-Renyi (ER) model ISj, where the degree distribution is Poissonian. It has been 
shown in [16] that the generating function of branch sizes Hi{z) satisfies the implicit equation 
Hi{z) = zG'q{Hi{z)) I {k), where Go{z) = YlkPkZ^ generating function of the degree 

distribution. In the configuration model loops become irrelevant in the thermodynamic limit 
— > +00 and each edge is a part of a tree. Here, and Nji are independent and the joint 
probability function has a product form P{Nl, Nji) = P{NL)P{N{i). The summation in vr^ 
can be carried out separately for A^^^ and N^, which yields 

7r, = 2(l-iJi(P(S))) (l-i/i(P(T))) 

- (1 - H,{P(S)) - H,{P{T)) + H,{P(ST))Y . (5) 

In the case of P2P discovery this can be reduced to 

= {1 - H,{1 - p))\ (6) 



This formula can be tested on the ER model, with Gq{z) = e^^''^'^~^\ The cluster 
size distribution can be given by the Lambert W-function Hi{z) = —W{— (k) e^^'^^z)/ {k). 
Simulation results are presented in Fig. [2]^a). The analytic result ([6]) is also shown for 
comparison. One can see that it is discontinuous at zero density if (k) > 1, when a giant 
component emerges in the network. The simulation data deviates from the analytic solution 
around the discontinuity due to finite-scale effects. The size of the jump is Pq = {1 — Hi{l))'^, 
which is precisely the probability of infinitely large branches being attached to both sides 
of an edge. If Pq is regarded as an order parameter, the observed phenomenon resembles a 
phase transition aX (k) = kc = 1. 

We also generated networks with power-law degree distribution using the hidden-variable 
model introduced in 17, 1^,Q,[2^. Simulations are shown in Fig.[2t^b) with degree exponent 
7 = 3. Note that the analytic solution is discontinuous at zero density, i.e. Pq > 0, for all 
(k) > 0. The phase transition can be observed again, since the analytic solution — and Pq — is 
independent of {k) below a critical point fcc(7) = Indeed, data points almost collapse 

at (k) = 0.5 and 1 which are below k^i'y = 3) ~ 1.3684. The phenomenon occurs when 
the degree generating function Gq{z) depends linearly on (k). This is characteristic of pure 
power-law distributions until (k) is below the critical value kc- 

Now we turn our attention to the degree distribution Pd{k') of the discovered nodes. In 
our analysis we consider only the contribution of those shortest-paths to k' which traverse 
a given node. We will show that Pd{k') is analogous to the degree distribution of a partially 
severed network obtained by random edge pruning. This duality between the sampling and 
the destruction of networks is very surprising considering the striking differences between 
the two processes. 

Let us consider a node v with original degree k. If every link is removed independently 
with probability p, then k', the degree of the node after random edge removal, will follow a 
binomial distribution: P{k' \ k) = (^,) (1 — p)^ pk-k' ^ Consequently, 

Ppruned(fc') = £ {l-pfp'^''Poik). (7) 

k=k' ^ ^ 

Regarding the sampling process we examine a randomly selected node of the discovered 
network v E Vd in the static model first. Let us suppose that the sizes of the branches with 
original degree k are Ni, N2, . . . ,Nj: (see Fig. [3]). For the sake of simplicity we discuss only 
the P2P sampling model, where the probability of placing a measurement node in branch i 




(a) Static ER network (b) Evolving BA tree with 
with {k) = 1 degree exponent 7 = 3. 



FIG. 4: (Color online) The probability of discovered degree Pdik') as the function of p in P2P 
sampling model for k' = 2,3, ... ,7. The original networks are N = 10^ node graphs. Data points 
are averaged for 10 networks with 10 samplings in each realization. Solid lines consist of analytic 
solution ([5]) for (a) ER and (b) BA network models, respectively. Exact solution for the evolving 
BA model is shown with dotted lines for comparison. 

is simply (l — o"^'). Since branch sizes are independent we can average over Ni separately. 
The results we obtain indicate that measurement nodes can be found in different branches 
with probability 1 — Hi{a). 

We can see from Fig. [3] that the degree of a discovered node k' equals the number 
of branches where measurement nodes can be found in. It follows that Pd{k' \ k) = 
p(viv^\k) {k') i^-Hi{cT)f H'l~'''{a), where 2 < k' < k. The subscript of refers to the 
probability distribution restricted to the discovered network. In order to obtain the distri- 
bution of k' one should average this probability over Pd{k), the distribution of the original 
degrees of the discovered nodes. This distribution can be obtained by Pdik) = '^^"p(^gy^°''^^ ; 
so 



where k' > 2 and P(v E Vd 



(8) 



P{v G Vd) 

1 - Go{Hi{a)) - {1 - Hi{a)) G'o{Hi{a)) It is evident from 
(I7j) and (IHl) that Pd{k') equals Pprunod(^') — normalized properly for k' > 2 — if p = Hi{a). In 
other words the discovered network is equivalent with an edge destroyed one. 

In the case of an evolving network at least one of the branches, say N^, tends to infinity 
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as ^ oo, so the probability that a measurement node can be found in the kth branch 
tends to 1. In order to circumvent this effect let us redefine the network in such a way that 
every link should be directed toward the gigantic side of the network. Let q = k — 1 denote 
the in-degree of nodes in this directed network. It is easy to see that the discovered in-degree 
gd will be equal to the number of branches where measurement nodes can be found in. We 
can follow the same procedure as in the case of the static model. We only need to replace 
fed and A; in ([8]) with the corresponding in-degrees qd and q, and the normalization constant 
with P{v eVd) = l- 

Simulation results are shown for both static and evolving networks in Fig. HI Note that 
we have assumed above that Hi{a) is independent of q. This is only an approximation in 
the case of the evolving network model. However, Hija \ q) can be calculated exactly for 
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the DM model, which is shown with dotted lines 

In conclusion we presented a study of network discovery processes. We derived analyti- 
cally the probability of founding an arbitrary link of the network via shortest-path network 
discovery. We considered both static and evolving random netwoks with various sampling 
scenarios. We also demonstrated an important duality between the discovery of networks 
by shortest paths and the destruction of the same network by edge removal. 
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